Getting started#
The xdatasets library enables users to effortlessly access a vast collection of earth observation datasets that are compatible with xarray formats.
The library adopts an opinionated approach to data querying and caters to the specific needs of certain user groups, such as hydrologists, climate scientists, and engineers. One of the functionalities of xdatasets is the ability to extract data at a specific location or within a designated region, such as a watershed or municipality, while also enabling spatial and temporal operations.
To use xdatasets, users must employ a query. For instance, a straightforward query to extract the variables t2m (2m temperature) and tp (Total precipitation) from the era5_reanalysis_single_levels dataset at two geographical positions (Montreal and Toronto) could be as follows:
query = {
"datasets": {"era5_reanalysis_single_levels_dev": {'variables': ["t2m", "tp"]}},
"space": {
"clip": "point", # bbox, point or polygon
"geometry": {'Montreal' : (45.508888, -73.561668),
'Toronto' : (43.651070, -79.347015)
}
}
}
An example of a more complex query would look like the one below.
Note
Don’t worry! Below, you’ll find additional examples that will assist in understanding each parameter in the query, as well as the possible combinations.
This query calls the same variables as above. However, instead of specifying geographical positions, a GeoPandas.DataFrame is used to provide features (such as shapefiles or geojson) for extracting data within each of them. Each polygon is identified using the unique identifier Station, and a spatial average is computed within each one (aggregation: True). The dataset, initially at an hourly time step, is converted into a daily time step while applying one or more temporal aggregations
for each variable as prescribed in the query. xdatasets ultimately returns the dataset for the specified date range and time zone.
query = {
"datasets": {"era5_reanalysis_single_levels_dev": {'variables': ["t2m", "tp"]}},
"space": {
"clip": "polygon", # bbox, point or polygon
"aggregation": True, # spatial average of the variables within each polygon
"geometry": gdf,
"unique_id": "Station" # unique column name in geodataframe
},
"time": {
"timestep": "D",
"aggregation": {"tp": np.nansum,
"t2m": [np.nanmax, np.nanmin]},
"start": '2000-01-01',
"end": '2020-05-31',
"timezone": 'America/Montreal',
},
}
Query climate datasets#
In order to use xdatasets, you must import at least xdatasets, pandas, geopandas, and numpy. Additionally, we import pathlib to interact with files.
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import xdatasets as xd
import geopandas as gpd
import pandas as pd
import numpy as np
# Visualization
import hvplot.xarray
import panel as pn
from pathlib import Path
/home/runner/work/xdatasets/xdatasets/xdatasets/core.py:6: UserWarning: Shapely 2.0 is installed, but because PyGEOS is also installed, GeoPandas will still use PyGEOS by default for now. To force to use and test Shapely 2.0, you have to set the environment variable USE_PYGEOS=0. You can do this before starting the Python process, or in your code before importing geopandas:
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas
In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
import geopandas as gpd
Clip by points (sites)#
To begin with, we need to create a dictionary of sites and their corresponding geographical coordinates.
[3]:
sites = {
'Montreal' : (45.508888, -73.561668),
'New York': (40.730610, -73.935242),
'Miami': (25.761681, -80.191788)
}
We will then extract the tp (total precipitation) and t2m (2m temperature) from the era5_reanalysis_single_levels dataset for the designated sites. Afterward, we will convert the time step to daily and adjust the timezone to Eastern Time. Finally, we will limit the temporal interval.
Before proceeding with this first query, let’s quickly outline the role of each parameter:
datasets: A dictionary where datasets serve as keys and desired variables as values.
space: A dictionary that defines the necessary spatial operations to apply on user-supplied geographic features.
time: A dictionary that defines the necessary temporal operations to apply on the datasets
For more information on each parameter, consult the API documentation.
This is what the requested query looks like :
[4]:
%%time
query = {
"datasets": 'era5_reanalysis_single_levels',
"space": {
"clip": "point", # bbox, point or polygon
"geometry": sites
},
"time": {
"timestep": "D", # http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
"aggregation": {"tp": np.nansum,
"t2m": np.nanmean},
"start": '1959-01-01',
"timezone": 'America/Montreal',
},
}
xds = xd.Query(**query)
Temporal operations: processing tp with era5_reanalysis_single_levels: 100%|██████████| 2/2 [00:09<00:00, 4.52s/it]
CPU times: user 1min 8s, sys: 8.54 s, total: 1min 17s
Wall time: 2min 25s
By accessing the data attribute, you can view the data obtained from the query. It’s worth noting that the variable name tp has been updated to tp_nansum to reflect the reduction operation (np.nansum) that was utilized to convert the time step from hourly to daily. Likewise, t2m was updated to t2m_nanmean.
[5]:
xds.data
[5]:
<xarray.Dataset>
Dimensions: (time_agg: 2, spatial_agg: 1, timestep: 1, site: 3,
time: 23491, source: 1)
Coordinates:
* time_agg (time_agg) object 'nanmean' 'nansum'
* spatial_agg (spatial_agg) object 'point'
* timestep (timestep) object 'D'
* site (site) <U8 'Montreal' 'New York' 'Miami'
* time (time) datetime64[ns] 1959-01-01 1959-01-02 ... 2023-04-25
latitude (site) float64 45.5 40.75 25.75
longitude (site) float64 -73.5 -74.0 -80.25
* source (source) <U29 'era5_reanalysis_single_levels'
Data variables:
t2m (time_agg, spatial_agg, timestep, time, site, source) float32 ...
tp (time_agg, spatial_agg, timestep, time, site, source) float32 ...
Attributes: (12/31)
GRIB_NV: 0
GRIB_Nx: 1440
GRIB_Ny: 721
GRIB_cfName: unknown
GRIB_cfVarName: t2m
GRIB_dataType: an
... ...
GRIB_typeOfLevel: surface
GRIB_units: K
coordinates: number time step surface latitu...
long_name: 2 metre temperature
standard_name: unknown
units: K[6]:
title = f"Comparison of total precipitation across three cities in North America from \
{xds.data.time.dt.year.min().values} to {xds.data.time.dt.year.max().values}"
xds.data \
.sel(
timestep='D',
spatial_agg='point',
time_agg='nansum',
source='era5_reanalysis_single_levels') \
.hvplot(
title=title,
x="time",
y="tp",
grid=True,
width=750,
height=450,
by="site",
legend="top",
widget_location="bottom")
[6]:
[7]:
title = f"Comparison of 2m temperature across three cities in North America from \
{xds.data.time.dt.year.min().values} to {xds.data.time.dt.year.max().values}"
xds.data \
.sel(
timestep='D',
spatial_agg='point',
time_agg='nanmean',
source='era5_reanalysis_single_levels') \
.hvplot(
title=title,
x="time",
y="t2m",
grid=True,
width=750,
height=450,
by="site",
legend="top",
widget_location="bottom")
[7]:
Clip on polygons with no averaging in space#
Let’s first access certain polygon features, which can be in the form of shapefiles, geojson, or any other format compatible with geopandas. In this example, we are using JSON (geojson) files.
[8]:
bucket = Path('https://s3.us-east-2.wasabisys.com/watersheds-polygons/MELCC/json')
paths = [bucket.joinpath('023003/023003.json'),
bucket.joinpath('031101/031101.json'),
bucket.joinpath('040111/040111.json')]
Subsequently, all of the files can be opened and consolidated into a single geopandas.GeoDataFrame object.
[9]:
gdf = pd.concat([gpd.read_file(path) for path in paths]).reset_index(drop=True)
gdf
[9]:
| Station | Superficie | geometry | |
|---|---|---|---|
| 0 | 023003 | 208.4591919813271 | POLYGON ((-70.82601 46.81658, -70.82728 46.815... |
| 1 | 031101 | 111.7131058782722 | POLYGON ((-73.98519 45.21072, -73.98795 45.209... |
| 2 | 040111 | 433.440893903503 | POLYGON ((-74.06645 46.02253, -74.06647 46.022... |
Let’s examine the geographic locations of the polygon features.
[10]:
gdf.hvplot(geo=True,
tiles='ESRI',
color='Station',
alpha=0.8,
width=750,
height=450,
legend='top',
hover_cols=['Station','Superficie'])
[10]:
The following query seeks the variables t2m and tp from the era5_reanalysis_single_levels dataset, covering the period between January 1, 1959, and September 30, 1961, for the three polygons mentioned earlier. It is important to note that as aggregation is set to False, no spatial averaging will be conducted, and a mask (raster) will be returned for each polygon.
[11]:
query = {
"datasets": {"era5_reanalysis_single_levels": {'variables': ["t2m", "tp"]}},
"space": {
"clip": "polygon", # bbox, point or polygon
"averaging": False, # spatial average of the variables within each polygon
"geometry": gdf,
"unique_id": "Station" # unique column name in geodataframe
},
"time": {
"start": '1959-01-01',
"end": '1963-08-31',
},
}
xds = xd.Query(**query)
Spatial operations: processing polygon 040111 with era5_reanalysis_single_levels: : 3it [00:00, 4.75it/s]
By accessing the data attribute, you can view the data obtained from the query. For each variable, the dimensions of time, latitude, longitude, and Station (the unique ID) are included. In addition, there is another variable called weights that is returned. This variable specifies the weight that should be assigned to each pixel if spatial averaging is conducted over a mask (polygon).
[12]:
xds.data
[12]:
<xarray.Dataset>
Dimensions: (latitude: 6, longitude: 5, time: 40896, Station: 3, source: 1)
Coordinates:
* latitude (latitude) float64 45.0 45.25 46.0 46.25 46.5 46.75
* longitude (longitude) float64 -74.5 -74.25 -74.0 -71.0 -70.75
* time (time) datetime64[ns] 1959-01-01 ... 1963-08-31T23:00:00
* Station (Station) object '023003' '031101' '040111'
* source (source) <U29 'era5_reanalysis_single_levels'
Data variables:
t2m (Station, time, latitude, longitude, source) float32 nan ... nan
tp (Station, time, latitude, longitude, source) float32 nan ... nan
weights (Station, latitude, longitude, source) float64 nan nan ... nan
Attributes:
Conventions: CF-1.6
history: 2022-11-10 02:03:41 GMT by grib_to_netcdf-2.25...
pangeo-forge:inputs_hash: 9423abd3198a0f0de3aa8368c73629d26c8207570bf035...
pangeo-forge:recipe_hash: 4b4eead2724b1cf53d6ddbf112d17dbbc49ecc83610af6...
pangeo-forge:version: 0.9.4Weights are much easier to comprehend visually, so let’s examine the weights returned for the station 023003. Notice that when selecting a single feature (Station 023003 in this case), the shape of our spatial dimensions is reduced to a 2x2 pixel area (longitude x latitude) that encompasses the entire feature.
[13]:
station = '023003'
ds_station = xds.data.sel(Station=station)
ds_clipped = xds.bbox_clip(ds_station).squeeze()
ds_clipped
[13]:
<xarray.Dataset>
Dimensions: (time: 40896, latitude: 2, longitude: 2)
Coordinates:
* latitude (latitude) float64 46.5 46.75
* longitude (longitude) float64 -71.0 -70.75
* time (time) datetime64[ns] 1959-01-01 ... 1963-08-31T23:00:00
Station <U6 '023003'
source <U29 'era5_reanalysis_single_levels'
Data variables:
t2m (time, latitude, longitude) float32 254.7 254.8 ... 287.6 287.5
tp (time, latitude, longitude) float32 nan nan ... 8.583e-06
weights (latitude, longitude) float64 1.38e-05 0.0001535 0.9079 0.09194
Attributes:
Conventions: CF-1.6
history: 2022-11-10 02:03:41 GMT by grib_to_netcdf-2.25...
pangeo-forge:inputs_hash: 9423abd3198a0f0de3aa8368c73629d26c8207570bf035...
pangeo-forge:recipe_hash: 4b4eead2724b1cf53d6ddbf112d17dbbc49ecc83610af6...
pangeo-forge:version: 0.9.4[14]:
(
(
ds_clipped.t2m.isel(time=0).hvplot(
title="The 2m temperature for pixels that intersect with the polygon on January 1, 1959",
tiles="ESRI",
geo=True,
alpha=0.6,
colormap="isolum",
width=750,
height=450,
)
* gdf[gdf.Station == station].hvplot(
geo=True,
width=750,
height=450,
legend="top",
hover_cols=["Station", "Superficie"],
)
)
+ ds_clipped.weights.hvplot(
title="The weights that should be assigned to each pixel when performing spatial averaging",
tiles="ESRI",
alpha=0.6,
colormap="isolum",
geo=True,
width=750,
height=450,
)
* gdf[gdf.Station == station].hvplot(
geo=True,
width=750,
height=450,
legend="top",
hover_cols=["Station", "Superficie"],
)
).cols(1)
[14]:
The two plots depicted above show the 2m temperature for each pixel that intersects with the polygon from Station 023003 and the corresponding weights to be applied to each pixel. In the lower plot, it is apparent that the majority of the polygon is situated in the upper-left pixel, which results in that pixel having a weight of approximately 91%. It is evident that the two lower pixels have very minimal intersection with the polygon, which results in their respective weights being nearly
zero (hover on the plot to verify the weights).
In various libraries, either all pixels that intersect with the geometries are kept, or only pixels with centers within the polygon are retained. However, as shown in the previous example, utilizing such methods can introduce significant biases in the final outcome. Indeed, keeping all four pixels intersecting with the polygon with equal weights when the temperature values in the lower pixels are roughly 2 degrees lower than that of the upper-left pixel would introduce significant biases. Therefore, utilizing weights is a more precise approach.
Clip on polygons with averaging in space#
The following query seeks the variables t2m and tp from the era5_reanalysis_single_levels and era5_land_reanalysis datasets, covering the period between January 1, 1950, to present, for the three polygons mentioned earlier. Note that when the aggregation parameter is set to True, spatial averaging takes place. In addition, the weighted mask (raster) described earlier will be applied to generate a time series for each polygon.
Additional steps are carried out in the process, including converting the original hourly time step to a daily time step. During this conversion, various temporal aggregations will be applied to each variable and a conversion to the local time zone will take place.
Note
If users opt to pass multiple dictionaries instead of a single large one, the following format is also considered acceptable.
[15]:
%%time
datasets = {
"era5_reanalysis_single_levels": {'variables': ["t2m", "tp"]},
"era5_land_reanalysis_dev": {'variables': ["t2m", "tp"]}
}
space = {
"clip": "polygon", # bbox, point or polygon
"averaging": True,
"geometry": gdf, # 3 polygons
"unique_id": "Station"
}
time = {
"timestep": "D",
"aggregation": {"tp": [np.nansum],
"t2m": [np.nanmax, np.nanmin]},
"start": '1950-01-01',
"timezone": 'America/Montreal',
}
xds = xd.Query(datasets=datasets,
space=space,
time=time)
/home/runner/work/xdatasets/xdatasets/xdatasets/workflows.py:62: UserWarning: "start_date" not found within input date time range. Defaulting to minimum time step in xarray object.
ds = subset_time(ds, start_date=start_time, end_date=end_time)
Spatial operations: processing polygon 040111 with era5_reanalysis_single_levels: : 3it [00:00, 4.25it/s]
Temporal operations: processing tp with era5_reanalysis_single_levels: 100%|██████████| 2/2 [00:11<00:00, 5.58s/it]
Spatial operations: processing polygon 040111 with era5_land_reanalysis_dev: : 3it [00:00, 3.59it/s]
Temporal operations: processing tp with era5_land_reanalysis_dev: 100%|██████████| 2/2 [00:07<00:00, 3.65s/it]
CPU times: user 1min 46s, sys: 13.8 s, total: 2min
Wall time: 6min 19s
[16]:
xds.data
[16]:
<xarray.Dataset>
Dimensions: (time_agg: 3, spatial_agg: 1, timestep: 1, Station: 3,
time: 26778, source: 2)
Coordinates:
* time_agg (time_agg) object 'nanmax' 'nanmin' 'nansum'
* spatial_agg (spatial_agg) object 'polygon'
* timestep (timestep) object 'D'
* Station (Station) object '023003' '031101' '040111'
* time (time) datetime64[ns] 1950-01-01 1950-01-02 ... 2023-04-25
* source (source) <U29 'era5_land_reanalysis_dev' 'era5_reanalysis_si...
Data variables:
t2m (time_agg, spatial_agg, timestep, Station, time, source) float64 ...
tp (time_agg, spatial_agg, timestep, Station, time, source) float64 ...[17]:
xds.data.squeeze().t2m.hvplot(x='time',
by='time_agg',
groupby=['source','Station'],
width=950,
height=450,
grid=True,
widget_location='bottom')
[17]:
The resulting dataset can be explored in the data attribute :
[18]:
xds.data.squeeze().tp \
.sel(time_agg='nansum') \
.hvplot(
x='time',
groupby=['source','Station'],
width=950,
height=450,
color='blue',
grid=True,
widget_location='bottom')
[18]:
Bounding box (bbox) around polygons#
The following query seeks the variable tp from the era5_land_reanalysis_dev dataset, covering the period between January 1, 1959, and December 31, 1970, for the bounding box that delimits the three polygons mentioned earlier.
Additional steps are carried out in the process, including converting to the local time zone.
[19]:
import xdatasets as xd
query = {
"datasets": {"era5_land_reanalysis_dev": {'variables': ["tp"]}},
"space": {
"clip": "bbox", # bbox, point or polygon
"geometry": gdf,
},
"time": {
"start": '1959-01-01',
"end": '1970-12-31',
"timezone": 'America/Montreal',
},
}
xds = xd.Query(**query)
[20]:
xds.data
[20]:
<xarray.Dataset>
Dimensions: (time: 105192, latitude: 30, longitude: 44, source: 1)
Coordinates:
* latitude (latitude) float64 47.8 47.7 47.6 47.5 ... 45.2 45.1 45.0 44.9
* longitude (longitude) float64 -74.9 -74.8 -74.7 -74.6 ... -70.8 -70.7 -70.6
* time (time) datetime64[ns] 1959-01-01 ... 1970-12-31T23:00:00
* source (source) <U24 'era5_land_reanalysis_dev'
Data variables:
tp (time, latitude, longitude, source) float32 0.0 0.0 ... 0.0 0.0
Attributes:
pangeo-forge:inputs_hash: 406b0ec50e81c07d0765733e50a80766cdbc8020364d0d...
pangeo-forge:recipe_hash: 4cee8c6810266a9da93211efd0899553098b572e8ec334...
pangeo-forge:version: 0.9.4
timezone: America/MontrealLet’s find out which day (24-hour period) was the rainiest in the entire region for the data retrieved in previous cell.
[21]:
indexer = xds.data \
.sel(source='era5_land_reanalysis_dev') \
.tp \
.sum(['latitude','longitude']) \
.rolling(time=24) \
.sum() \
.argmax('time') \
.values
xds.data.isel(time=indexer).time.dt.date.values.tolist()
[21]:
datetime.date(1970, 9, 4)
Let’s vizualise the evolution of the hourly precipitation during that day. Note that each image (raster) delimits exactly the bounding box required to cover all polygons in the query. Please note that for full interactivity, running the code in a Jupyter Notebook is necessary.
[22]:
da = xds.data.tp.isel(time=slice(indexer - 24, indexer))
#da = da.where(da>0.0001, drop=True)
(da*1000)\
.sel(source='era5_land_reanalysis_dev') \
.squeeze() \
.hvplot.quadmesh(
width=750,
height=450,
geo=True,
tiles='ESRI',
groupby=["time"],
legend="top",
cmap='gist_ncar',
widget_location="bottom",
widget_type='scrubber',
dynamic=False,
clim=(0.01, 10))
[22]:
Query hydrological datasets#
Hydrological queries are still being tested and output format is likely to change. Stay tuned!
[23]:
%%time
query = {
"datasets": {'hydrometric': {'variables': ['streamflow',
't2m_nanmax',
't2m_nanmin',
'tp_nansum'],
'id': ['01010070', '01016500', '01017290','02*'],
}
},
}
xds = xd.Query(**query)
xds.data
CPU times: user 10.3 s, sys: 886 ms, total: 11.2 s
Wall time: 12.9 s
[23]:
<xarray.Dataset>
Dimensions: (id: 504, time: 44990, source: 1)
Coordinates:
* id (id) object '01010070' '01016500' ... '02492343' '02492360'
* time (time) datetime64[ns] 1900-01-01 1900-01-02 ... 2023-03-06
* source (source) <U11 'hydrometric'
Data variables:
streamflow (id, time, source) float64 dask.array<chunksize=(1, 22495, 1), meta=np.ndarray>
t2m_nanmax (id, time, source) float32 dask.array<chunksize=(1, 44990, 1), meta=np.ndarray>
t2m_nanmin (id, time, source) float32 dask.array<chunksize=(1, 44990, 1), meta=np.ndarray>
tp_nansum (id, time, source) float32 dask.array<chunksize=(1, 44990, 1), meta=np.ndarray>[24]:
id1 = pn.widgets.Select(value="01010070", options=list(xds.data.id.values), name="id")
source = pn.widgets.Select(
value="hydrometric", options=list(xds.data.source.values), name="source"
)
@pn.depends(id1, source)
def plot_hydrograph_and_weather(id1, source):
da = xds.data.sel(id=id1, source=source)
dx = da["streamflow"].dropna("time", how="any")
trace1 = da["streamflow"].hvplot(
grid=True,
widget_location="bottom",
color="black",
xlim=(dx.time[0].values, dx.time[-1].values),
title=f"Daily streamflow at location {id1}",
width=850,
height=300,
)
trace2 = da[["t2m_nanmax", "t2m_nanmin"]].hvplot(
grid=True,
widget_location="bottom",
color=["red", "blue"],
xlim=(dx.time[0].values, dx.time[-1].values),
title=f"Daily minimum and maximum temperature at location {id1}",
width=850,
height=300,
)
trace3 = da[["tp_nansum"]].hvplot(
grid=True,
color=["turquoise"],
xlim=(dx.time[0].values, dx.time[-1].values),
title=f"Daily precipitation at location {id1}",
width=850,
height=300,
)
return pn.Column(trace1, trace2, trace3)
pn.Row(
pn.Column("## Hydrometric Data Explorer", id1, source, plot_hydrograph_and_weather)
)
[24]: